17 research outputs found

    Planting a SEED of Vision in Large Language Model

    Full text link
    We present SEED, an elaborate image tokenizer that empowers Large Language Models (LLMs) with the emergent ability to SEE and Draw at the same time. Research on image tokenizers has previously reached an impasse, as frameworks employing quantized visual tokens have lost prominence due to subpar performance and convergence in multimodal comprehension (compared to BLIP-2, etc.) or generation (compared to Stable Diffusion, etc.). Despite the limitations, we remain confident in its natural capacity to unify visual and textual representations, facilitating scalable multimodal training with LLM's original recipe. In this study, we identify two crucial principles for the architecture and training of SEED that effectively ease subsequent alignment with LLMs. (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a 1D causal dependency, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs. (2) Image tokens should capture high-level semantics consistent with the degree of semantic abstraction in words, and be optimized for both discriminativeness and reconstruction during the tokenizer training phase. As a result, the off-the-shelf LLM is able to perform both image-to-text and text-to-image generation by incorporating our SEED through efficient LoRA tuning. Comprehensive multimodal pretraining and instruction tuning, which may yield improved results, are reserved for future investigation. This version of SEED was trained in 5.7 days using only 64 V100 GPUs and 5M publicly available image-text pairs. Our preliminary study emphasizes the great potential of discrete visual tokens in versatile multimodal LLMs and the importance of proper image tokenizers in broader research.Comment: Technical Report; Project released at: https://github.com/AILab-CVC/SEE

    Contrastive Masked Autoencoders for Self-Supervised Video Hashing

    Full text link
    Self-Supervised Video Hashing (SSVH) models learn to generate short binary representations for videos without ground-truth supervision, facilitating large-scale video retrieval efficiency and attracting increasing research attention. The success of SSVH lies in the understanding of video content and the ability to capture the semantic relation among unlabeled videos. Typically, state-of-the-art SSVH methods consider these two points in a two-stage training pipeline, where they firstly train an auxiliary network by instance-wise mask-and-predict tasks and secondly train a hashing model to preserve the pseudo-neighborhood structure transferred from the auxiliary network. This consecutive training strategy is inflexible and also unnecessary. In this paper, we propose a simple yet effective one-stage SSVH method called ConMH, which incorporates video semantic information and video similarity relationship understanding in a single stage. To capture video semantic information for better hashing learning, we adopt an encoder-decoder structure to reconstruct the video from its temporal-masked frames. Particularly, we find that a higher masking ratio helps video understanding. Besides, we fully exploit the similarity relationship between videos by maximizing agreement between two augmented views of a video, which contributes to more discriminative and robust hash codes. Extensive experiments on three large-scale video datasets (i.e., FCVID, ActivityNet and YFCC) indicate that ConMH achieves state-of-the-art results. Code is available at https://github.com/huangmozhi9527/ConMH.Comment: This work is accepted by the AAAI 2023. 9 pages, 6 figures, 6 table

    Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

    Full text link
    Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years. Despite some progress, existing methods are mostly limited to highly curated datasets (e.g., K400) and exhibit unsatisfactory out-of-the-box representations. We argue that it is due to the fact that they only capture pixel-level knowledge rather than spatiotemporal commonsense, which is far away from cognition-level video understanding. Inspired by the great success of image-text pre-training (e.g., CLIP), we take the first step to exploit language semantics to boost transferable spatiotemporal representation learning. We introduce a new pretext task, Turning to Video for Transcript Sorting (TVTS), which sorts shuffled ASR scripts by attending to learned video representations. We do not rely on descriptive captions and learn purely from video, i.e., leveraging the natural transcribed speech knowledge to provide noisy but useful semantics over time. Furthermore, rather than the simple concept learning in vision-caption contrast, we encourage cognition-level temporal commonsense reasoning via narrative reorganization. The advantages enable our model to contextualize what is happening like human beings and seamlessly apply to large-scale uncurated video data in the real world. Note that our method differs from ones designed for video-text alignment (e.g., Frozen) and multimodal representation learning (e.g., Merlot). Our method demonstrates strong out-of-the-box spatiotemporal representations on diverse video benchmarks, e.g., +13.6% gains over VideoMAE on SSV2 via linear probing

    The final step effect

    Get PDF
    Suppose you need to complete a task of 5 steps, each of which has equal difficulty and pass rate. You somehow have a privilege that can ensure you pass one of the steps, but you need to decide which step to be privileged before you start the task. Which step do you want to privilege? Mathematically speaking, the effect of each step on the final outcome is identical, and so there seems to be no prima facie reason for a preference. Five studies were conducted to explore this issue. In Study 1, participants could place the privilege on any of steps 1–5. Participants were most inclined to privilege step 5. In Study 2, participants needed to pay some money to purchase the privilege for steps 1–5, respectively. Participants would pay most money for step 5. Study 3 directly reminded participants that the probability of success of the whole task is mathematically the same, no matter on which step the privilege is placed, but most of the participants still prefer to privilege the final step. Study 4 supposed that the outcomes of all steps were not announced until all steps were finished, and asked how painful participants would feel if they passed all steps but one. People thought they would feel most painful when they failed at the final step. In Study 5, an implicit association test showed that people associated the first step with easy and the final step with hard. These results demonstrated the phenomenon of the final step effect and suggested that both anticipated painfulness and stereotype may play a role in this phenomenon

    MISSRec: Pre-training and Transferring Multi-modal Interest-aware Sequence Representation for Recommendation

    Full text link
    The goal of sequential recommendation (SR) is to predict a user's potential interested items based on her/his historical interaction sequences. Most existing sequential recommenders are developed based on ID features, which, despite their widespread use, often underperform with sparse IDs and struggle with the cold-start problem. Besides, inconsistent ID mappings hinder the model's transferability, isolating similar recommendation domains that could have been co-optimized. This paper aims to address these issues by exploring the potential of multi-modal information in learning robust and generalizable sequence representations. We propose MISSRec, a multi-modal pre-training and transfer learning framework for SR. On the user side, we design a Transformer-based encoder-decoder model, where the contextual encoder learns to capture the sequence-level multi-modal synergy while a novel interest-aware decoder is developed to grasp item-modality-interest relations for better sequence representation. On the candidate item side, we adopt a dynamic fusion module to produce user-adaptive item representation, providing more precise matching between users and items. We pre-train the model with contrastive learning objectives and fine-tune it in an efficient manner. Extensive experiments demonstrate the effectiveness and flexibility of MISSRec, promising an practical solution for real-world recommendation scenarios.Comment: Accepted to ACM MM 202

    An MEC Architecture-Oriented Improved RRT Algorithm for Regional Trajectory Planning

    No full text
    Multi-access Edge Computing (MEC), which could provide real-time computing ability, is considered as an effective approach to improve performance of Vehicular Ad Hoc Network (VANET). MEC could process regional vehicles information and generate real-time road hazard features, which could be used to realize trajectory planning progress of vehicles. In this paper, an MEC-oriented VANET infrastructure is presented, and a road hazard feature-based trajectory planning method is proposed. Back Propagation (BP) neural network is employed to predict road hazard feature changing, while a hazard-based cost function is defined. Then, an improved Rapidly Exploring Random Tree (RRT) algorithm is proposed for novel regional trajectory planning. A joint simulation is done based on SUMO and NS3 platforms. Simulation results verify the effectiveness and stability of the proposed algorithm

    Coexistence of Antibiotic Resistance Genes and Virulence Factors Deciphered by Large-Scale Complete Genome Analysis

    No full text
    Widespread use of antibiotics has enhanced the evolution of highly resilient pathogens and poses a severe risk to human health via coselection of antibiotic resistance genes (ARGs) and virulence factors (VFs). In this study, we rigorously evaluate the abundance relationship and physical linkage between ARGs and VFs by performing a comprehensive analysis of 9,070 bacterial genomes isolated from multiple species and hosts. The coexistence of ARGs and VFs was observed in bacteria across distinct phyla, pathogenicities, and habitats, especially among human-associated pathogens. The coexistence patterns of gene elements in different habitats and pathogenicity groups were similar, presumably due to frequent gene transfer. A shorter intergenic distance between mobile genetic elements and ARGs/VFs was detected in human/animal-associated bacteria, indicating a higher transfer potential. Increased accumulation of exogenous ARGs/VFs in human pathogens highlights the importance of gene acquisition in the evolution of human commensal bacteria. Overall, the findings provide insights into the genic features of combinations of ARG-VF and expand our understanding of ARG-VF coexistence in bacteria. IMPORTANCE Antibiotic resistance has become a serious global health concern. Despite numerous case studies, a comprehensive analysis of ARG and VF coexistence in bacteria is lacking. In this study, we explore the coexistence profiles of ARGs and VFs in diverse categories of bacteria by using a high-resolution bioinformatics approach. We also provide compelling evidence of unique ARG-VF gene pairs coexisting in specific bacterial genomes and reveal the potential risk associated with the coexistence of ARGs and VFs in organisms in both clinical settings and environments
    corecore